Murray et al. Population Health Metrics 201 1, 9:27 
http://www.pophealthmetrics.eom/content/9/1/27 



POPULATION HEALTH METRICS 



RESEARCH Open Access 



Population Health Metrics Research Consortium 
gold standard verbal autopsy validation study: 
design, implementation, and development of 
analysis datasets 

Christopher JL Murray 1 *, Alan D Lopez 2 , Robert Black 3 , Ramesh Ahuja 4 , Said Mohd Ali 5 , Abdullah Baqui 3 , 

Lalit Dandona 1,6 , Emily Dantzer 7 , Vinita Das 8 , Usha Dhingra 3 , Arup Dutta 3 , Wafaie Fawzi 9 , Abraham D Flaxman 1 , 

Sara Gomez 10 , Bernardo Hernandez 10 , Rohina Joshi 11 , Henry Kalter 3 , Aarti Kumar 4 , Vishwajeet Kumar 4 , 

Rafael Lozano 1 , Marilla Lucero 12 , Saurabh Mehta 13 , Bruce Neal 11 , Summer Lockett Ohno 1 , Rajendra Prasad 8 , 

Devarsetty Praveen 14 , Zul Premji 15 , Dolores Ramirez-Villalobos 10 , Hazel Remolador 12 , Ian Riley 2 , Minerva Romero 10 , 

i r in O in 

Mwanaidi Said , Diozele Sanvictores , Sunil Sazawal and Veronica Tallo 



Abstract 

Background: Verbal autopsy methods are critically important for evaluating the leading causes of death in 
populations without adequate vital registration systems. With a myriad of analytical and data collection approaches, 
it is essential to create a high quality validation dataset from different populations to evaluate comparative method 
performance and make recommendations for future verbal autopsy implementation. This study was undertaken to 
compile a set of strictly defined gold standard deaths for which verbal autopsies were collected to validate the 
accuracy of different methods of verbal autopsy cause of death assignment. 

Methods: Data collection was implemented in six sites in four countries: Andhra Pradesh, India; Bohol, Philippines; 
Dar es Salaam, Tanzania; Mexico City, Mexico; Pemba Island, Tanzania; and Uttar Pradesh, India. The Population 
Health Metrics Research Consortium (PHMRC) developed stringent diagnostic criteria including laboratory, 
pathology, and medical imaging findings to identify gold standard deaths in health facilities as well as an 
enhanced verbal autopsy instrument based on World Health Organization (WHO) standards. A cause list was 
constructed based on the WHO Global Burden of Disease estimates of the leading causes of death, potential to 
identify unique signs and symptoms, and the likely existence of sufficient medical technology to ascertain gold 
standard cases. Blinded verbal autopsies were collected on all gold standard deaths. 

Results: Over 12,000 verbal autopsies on deaths with gold standard diagnoses were collected (7,836 adults, 2,075 
children, 1,629 neonates, and 1,002 stillbirths). Difficulties in finding sufficient cases to meet gold standard criteria 
as well as problems with misclassification for certain causes meant that the target list of causes for analysis was 
reduced to 34 for adults, 21 for children, and 10 for neonates, excluding stillbirths. To ensure strict independence 
for the validation of methods and assessment of comparative performance, 500 test-train datasets were created 
from the universe of cases, covering a range of cause-specific compositions. 

Conclusions: This unique, robust validation dataset will allow scholars to evaluate the performance of different 
verbal autopsy analytic methods as well as instrument design. This dataset can be used to inform the 
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implementation of verbal autopsies to more reliably ascertain cause of death in national health information 
systems. 

Keywords: Verbal autopsy, VA, validation, Philippines, Tanzania, India, Mexico, gold standard, cause of death 



Background 

Verbal autopsy (VA) is a critically important tool to 
measure causes of death in populations without com- 
plete medical certification of causes of death. A variety 
of methods have been proposed for VA cause assign- 
ment [1,2], ranging from physician-certified verbal 
autopsy (PCVA) [3,4] to data-derived algorithms [5-7], 
various applications of Bayes' theorem [8-13], and direct 
statistical estimation of cause fractions [14]. New meth- 
ods to analyze VAs and attribute causes of death to 
them are now being developed [15-19], and it is likely 
that there will continue to be new methods and refine- 
ments. Given both the increasing demand for good 
cause of death information for the world's poorest 
populations and the expanding array of VA approaches, 
it is essential to be able to assess the performance of 
these options in a scientific and comparable manner. 

Several validation studies of VA cause assignment 
methods have been published [2,3,12,20-31]. Results of 
validation studies to date, however, have been chal- 
lenged on several grounds [32-34]. First, previously pub- 
lished validation studies compare the cause of death for 
individuals derived from verbal autopsy to the cause of 
death recorded in hospital records or that derived from 
independent review of hospital medical records. The 
quality of record keeping and the laboratory, medical 
imaging, and pathological services available in many 
developing country hospitals can be extremely poor. 
This is especially true in resource-poor remote areas 
where validation studies have been undertaken. As a 
result, many of these validation studies are actually com- 
parisons of two imperfect cause of death assignment 
approaches: low-quality hospital-assigned cause of death 
and the verbal autopsy. In the language of psycho- 
metrics, most studies provide information on convergent 
validity rather than a comparison to a true gold stan- 
dard known as criterion validity [35] . Second, many stu- 
dies start with a community sample and then trace back 
as many deaths to hospital records as possible. The 
resulting studies often yield small numbers for many 
causes, so that published results only cover the conver- 
gent validity of VA with hospital-assigned (or derived) 
causes of death for a limited number of causes of death. 
For many important causes of death such as liver cir- 
rhosis, chronic obstructive pulmonary disease (COPD), 
or specific sites of cancer, there is essentially no pub- 
lished information on performance of VA. Third, valida- 
tion studies often do not provide details on the exact 



items in the VA instrument, the training of interviewers, 
the training of physicians for PCVA, the coding of death 
certificates completed by physicians for PCVA, or the 
protocol used to extract a cause of death from the hos- 
pital records. 

The Population Health Metrics Research Consortium 
(PHMRC) gold standard verbal autopsy validation study 
was initiated in 2005 to address these research limita- 
tions and to ensure that comparative assessments of VA 
performance were based on clinically reliable diagnoses. 
We designed the study as a multisite collaboration that 
aims to address some of the key limitations of previous 
validation studies and stimulate the development of new 
methods or refinements of existing methods. The pri- 
mary goal was to collect a dataset that would help pro- 
vide more definitive answers as to which VA approaches 
are more valid and to capture data in a standardized 
way. In this paper, we describe the design of the study, 
the criteria used to establish a gold standard (GS) cause 
of death, the implementation of fieldwork, and the crea- 
tion of standardized datasets for developing and testing 
new methods. 

Methods 

Data collection sites 

Gold standard VA data collection was implemented in 
six sites in four countries: Andhra Pradesh, India; Bohol, 
Philippines; Dar es Salaam, Tanzania; Mexico City, Mex- 
ico; Pemba Island, Tanzania; and Uttar Pradesh, India. 
Table 1 shows the age and sex distribution for the dece- 
dents represented in this study, as well as the national 
life expectancy. 

Research at the Andhra Pradesh, India, site was imple- 
mented and coordinated through the George Institute 
for Global Health, India, and was centered in the main 
capital city, Hyderabad, as well as the neighboring areas 
of Ranga Reddy, Medak, and Nalgonda. Hyderabad is 
100% urban with a population of roughly 3,830,000 
inhabitants. The neighboring area Ranga Reddy has a 
similar population size (3,575,000) and is roughly half 
urban and half rural. The Medak and Nalgonda areas 
are similar to each other, both roughly 14% urban, com- 
prised of 3,248,000 people in Nalgonda and 2,670,000 in 
Medak. 

The Bohol Island site was led by the Research Insti- 
tute for Tropical Medicine in Manila. Bohol is a tropical 
island province located in the Central Visayas of the 
Philippines, with 46 municipalities and Tagbilaran City. 



Murray et al. Population Health Metrics 201 1, 9:27 
http://www.pophealthmetrics.eom/content/9/1/27 



Page 3 of 1 5 



Table 1 The age and sex distribution of the decedents represented in the verbal autopsy sample and the national life 
expectancy for the country according to the 2010 United Nations numbers 



Site 


National life expectancy 






Decedents sam 


pled 








% Male 


% Female 


% Under age 5 


% Ages 5 - 59 


% Ages 60+ 


Andhra Pradesh, India 


64.2 


59 


41 


28 


55 


17 


Bohol, Philippines 


67.8 


56 


44 


31 


38 


31 


Dar es Salaam, Tanzania 


55.4 


48 


52 


44 


41 


15 


Federal District and Morelos, Mexico 


76.2 


53 


46 


21 


46 


34 


Pemba Island, Tanzania 


55.4 


52 


48 


60 


31 


10 


Uttar Pradesh, India 


64.2 


58 


42 


24 


58 


18 



Verbal autopsies were collected over the entire island, as 
well as a small proportion from Manila. According to 
the 2007 census, 1,230,000 people live in Bohol. Manila 
is urban, while Bohol is divided into roughly 46% urban 
and 54% rural. 

The research site in Dar es Salaam, Tanzania, was 
managed by collaborators at the Muhimbili University 
of Health and Allied Sciences. Verbal autopsies were 
collected from all over the city of Dar es Salaam, which 
has a population of roughly 2,487,000 people according 
to the 2002 census, with 94% of people living in urban 
areas and 6% living in rural areas. 

The Mexican study was coordinated by the National 
Institute of Public Health in the Federal District and the 
state of Morelos. According to the 2010 Census, 8.85 
million inhabitants live in the Federal District and 1.8 
million live in Morelos. Sixteen percent of the popula- 
tion of the state lives in rural areas [36]. 

Pemba Island, Tanzania, is the smaller of the two 
islands of the Zanzibar archipelago. The research there 
was coordinated through the Public Health Laboratory 
Ivo de Carneri as part of a collaboration between the 
Ministry of Health and Social Welfare and Johns Hop- 
kins University. Verbal autopsies were collected from all 
areas of the island. This island has a population of 
roughly 400,000 inhabitants. The island is 99% rural and 
1% semi-urban. 

Finally, the Uttar Pradesh site in India was led by col- 
laborators at the CSM Medical University (CSMMU, 
formerly, King George Medical College) in Lucknow. 
Verbal autopsies were collected from a wide range of 
districts in the state of Uttar Pradesh: Ambedkar Nagar, 
Bahraich, Barabanki, Basti, Faizabad, Gonda, Hardoi, 
Lakhimpur, Lucknow, Rae Bareli, Sitapur, Sultanpur, 
and Unnao. Table 2 shows the population and urban 
percentage for each of these districts. 

Instrument 

The instrument development was based on the WHO 
standardized verbal autopsy instrument [37], which in 
turn was based in part on the work of Chandramohan 



et al. (1994) for adult deaths and of Anker et al. (1999) 
for neonatal and child deaths [38,39]. Separate questions 
were developed for neonatal deaths and stillbirths, chil- 
dren 1 month to 11 years, and adults 12 years and 
older. Experience gained from VA studies in Andhra 
Pradesh and China where the WHO instrument, or 
slight variants of it, had been applied was also consid- 
ered [40,41]. A committee drawn from the principal and 
associate investigators considered modifications based 
on published and unpublished experiences with the 
WHO instrument, including fieldwork conducted as 
part of a large VA study in Thailand. The final instru- 
ment was translated into the respective local languages, 
and then back-translated to English by a different trans- 
lator to ensure accuracy. 

The PHMRC instrument is comprised of a general 
information module, an adult module, and a child and 
neonatal module. Skip patterns were integrated into the 
general information module to collect the age of the 
deceased and then direct interviewers to the correct 
module to administer. In administering the WHO 



Table 2 The population size in thousands and percent of 
population that is urban for the Uttar Pradesh, India 
field sites, according to the 2001 Census of India 





Population Size 


% Urban 


Ambedkar Nagar 


2,026 


9 


Bahraich 


2,381 


10 


Barabanki 


2,673 


9 


Basti 


2,084 


6 


Faizabad 


2,088 


13 


Gonda 


2,765 


7 


Hardoi 


3,398 


12 


Lakhimpur 


889 


7 


Lucknow 


3,647 


64 


Rae Bareli 


2,872 


10 


Sitapur 


3,619 


12 


Sultanpur 


3,214 


4 


Unnao 


2,700 


15 
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instrument, the interviewer must first determine the age 
of the deceased and select the correct instrument to 
deliver, which results in the potential for more inter- 
viewer error and a less fluid interview. The general 
information module, which is administered in all verbal 
autopsies, collects items such as education of the dece- 
dent, household characteristics, and a household roster. 
The adult module collects a history of chronic condi- 
tions, symptoms of the deceased, women's health ques- 
tions if the decedent is female, alcohol and tobacco use, 
and injury information; it also transcribes any available 
medical record and death certificate information. The 
child and neonatal module first asks background ques- 
tions on information such as whether the mother is still 
alive, where the deceased was born, the size of the dece- 
dent at birth, and the delivery date. The questionnaire 
then ascertains whether the decedent was a stillbirth 
and, if so, collects symptom questions, such as signs of 
injury. If not, the questionnaire collects more general 
information such as the age of the baby or child when 
they became ill and the age at death. If the decedent is 
under 28 days (inclusive of stillbirths), a maternal his- 
tory is collected. In addition, if the decedent is under 28 
days and was born live, a full set of neonatal symptom 
questions are collected. If the decedent is between 28 
days to 11 years, infant and child symptom questions 
are asked. All available health records and death certifi- 
cates are transcribed for both neonatal and child deaths. 
Finally, for all ages, the open narrative section was 
moved to the end of the interview, after the structured 
questions. This was done to ensure that in future work, 
we could remove the open-ended items without concern 
that the results collected in this study were a function 
of the open-ended items coming prior to structured 
content. 

In addition to the structural changes, there are impor- 
tant differences between the PHMRC instrument and 
the WHO instrument. First, the WHO adult module is 
administered on ages 15 and above, while the PHMRC 
adult module begins at age 12. This expansion of the 
ages included in the adult module ensures that condi- 
tions clinically present, such as maternal mortality in 12 
to 14 year olds, are captured through this instrument. 
Second, a substantial portion of the questions were 
reworded to ensure clarity. Medical terminology was 
converted to easily understandable descriptions to target 
a lay population. For example, "Did s/he have abdominal 
distension?" was reworded to "Did [NAME] have a more 
than usual protruding belly?" Information was also 
added for precision, or removed to ensure only the most 
diagnostically relevant information was collected. Simi- 
larly, we added or dropped entire questions to capture 
the most essential information, while reducing the dura- 
tion of the interview as much as possible. One common 



question type dropped from the instrument was the 
duration of certain symptoms. For example, the 
PHMRC instrument asks whether adults had developed 
a lump in the neck, armpit, breast, or groin but dropped 
the follow up question "For how long did s/he have the 
lumps?" as the presence of the symptom alone was the 
most important information. Another common question 
type dropped from the WHO instrument was about 
treatment that had been received by the decedent, as 
they were less important in informing the cause of 
death. Finally, the PHMRC instrument did not include 
questions about chronic conditions in children, such as 
cancer, tuberculosis, and diabetes. Additional file 1 illus- 
trates the content questions, such as symptoms experi- 
enced by the decedent that were added or dropped 
when converted from the WHO instrument to the 
PHMRC instrument. The small wording changes are not 
included in this additional file, though the full PHMRC 
instrument is included in Additional file 2 (general mod- 
ule), Additional file 3 (adults), and Additional file 4 
(children and neonates) for reference. 

Cause list 

A key challenge for the study was to identify the cause 
list for each of the three age groups for which we would 
seek to collect a sample of gold standard deaths. Our 
selection of the target cause list was based on considera- 
tion of the WHO estimates of the leading causes of 
death in the developing world in each age group, those 
causes for which verbal autopsy might be able to func- 
tion adequately because unique signs and symptoms 
could potentially be collected in an interview, and the 
potential to find, in the six sites, deaths with sufficient 
laboratory, medical imaging, and pathological detail in 
order that a gold standard cause of death assignment 
could be made. The cause lists were also designed so 
that they were mutually exclusive and collectively 
exhaustive. The target cause list for adults, children, and 
neonates included 53, 27, and 13 GS causes, respec- 
tively, plus stillbirths (for a complete list of causes, see 
Additional file 5). These cause lists are much longer 
than for any previously undertaken VA validation study. 
In fact, nearly all previous VA validation studies have 
started with a community or convenience sample of 
deaths and then ascertained cause in hospital records 
rather than seeking to collect data on a list of causes by 
design. 

Gold standard criteria 

A critical component of the study was the development, 
for each cause, of clear criteria that had to be fulfilled 
for a death to be assigned as a GS cause of death. 
Depending on the cause of death, these criteria included 
clinical endpoints, laboratory findings, medical imaging, 
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and pathology. Additional file 6 (adults) and Additional 
file 7 (children and neonates) provide the gold standard 
criteria for each cause. These gold standard criteria were 
developed by a committee of physicians involved in the 
study and underwent multiple cycles of group review. 

Preliminary review of hospital records in the sites 
indicated it would be very difficult to identify any deaths 
for some causes that would meet the strict gold stan- 
dard criteria. In order to ensure that as many potentially 
eligible deaths in each site as possible were collected for 
the study, a less strict but nevertheless detailed level 2 
set of criteria were also developed (see Additional files 6 
and 7). In some cases, these level 2 criteria were further 
disaggregated into level 2A and level 2B. By way of 
example, the criteria for determining a death as being 
due to adult breast cancer, adult acute myocardial 
infarction, child pneumonia, and neonatal birth asphyxia 
are shown in Table 3. 

By recording the level of diagnosis for each death, we 
are able to test whether the assessment of performance 
for any method is affected by the level of cause of death 
assignment according to our criteria. 

Data collection 

Identification of gold standard deaths 

As described above, a stringent set of diagnostic criteria 
for each cause of death was developed by a team of 
study physicians before fieldwork began. Each site then 
enrolled local health facilities at which medical records 
would be reviewed. Consortium members led a two-day 
training at each of the sites to train the reviewers in the 
gold standard definitions, the protocols for identifying 
cases meeting these criteria, and the procedure for 
extracting the pertinent medical information. Each 
reviewer was provided a pocket guide detailing the 
necessary criteria for each gold standard cause of death. 
The medical information from qualifying records was 
extracted using a standard medical data extraction form 
(MDEF, see Additional file 8), which the study team 
developed. Once eligible records were extracted, a local 
physician reviewed the medical information and deter- 
mined the gold standard level of the particular case 
according to the diagnostic criteria outlined for each 
level for each cause. The following information details 
the specific protocol followed by each research site. 

In Andhra Pradesh, four hospitals were recruited for 
the study. Three are government hospitals - Gandhi 
Hospital, Osmania General Hospital, and Chest Hospital 
- and one is a private hospital, CARE Foundation. There 
was 24-hour surveillance at the hospitals and all patients 
were enrolled with their addresses. Study supervisors 
collected information on all deceased patients from all 
wards, and clinicians involved in the study then 
reviewed the case sheets to select those that conformed 



to the gold standard criteria (levels 1, 2A, and 2B). The 
medical information from all qualifying cases selected by 
the clinicians was extracted and sent to the George 
Institute Hyderabad office for enrollment in the verbal 
autopsy study. 

In Bohol, the majority of deaths were reviewed at the 
Bohol Regional Hospital. This facility is the referral hos- 
pital for Bohol Province with the highest available stan- 
dards of clinical investigation and hence diagnosis. 
Three nurses monitored all deaths in the hospital. They 
ensured that all reports of investigations (imaging and 
laboratory) were located and attached to the charts. In 
addition, to augment the number of deaths collected, 
467 deaths were recruited from two hospitals in Manila: 
the Veterans Memorial Medical Center and the Rizal 
Medical Center. In all locations, the nurses summarized 
the case notes, including reports of investigations, onto 
the medical data extraction forms. MDEFs were first 
reviewed by two study physicians who assigned cause of 
death and decided by diagnosis and GS level which VAs 
should not be collected. Deaths were reviewed as soon 
as possible after the death. 

At the Dar es Salaam site, five health facilities were 
used as recruitment sites. These were Mwananyamala 
Hospital, Temeke Hospital, Muhimbili National Hospi- 
tal, Ocean Road Cancer Institute, and Hindu Mandal 
Hospital. Mwananyamala and Temeke are both district 
hospitals, each of which records roughly 1,500 deaths 
per year. Ocean Road Cancer Institute is the only cancer 
treatment facility in Tanzania and was an important 
source for causes such as cervical cancer, esophageal 
cancer, breast cancer, leukemia, prostate cancer, and 
lymphomas. Muhimbili National Hospital is a referral 
and teaching hospital with a higher mortality rate than 
the other enrolled facilities. Hindu Mandal Hospital is a 
private hospital in the heart of Dar es Salaam. It has a 
well-established HIV/AIDS clinic and commonly 
receives noncommunicable disease cases. At each loca- 
tion, a nurse affiliated with the study reviewed medical 
records to identify qualifying cases. The cases identified 
by the nurses were reviewed by physicians, who filled 
out the MDEFs with the gold standard levels for the 
cases that were eligible for enrollment. The nurses 
spoke with family members of the deceased if present at 
the hospital to enroll them in the study, collect their 
consent, and obtain mapping information and directions 
for a verbal autopsy interview. 

In Mexico, after obtaining authorization to work in 
each medical unit, a group of six trained physicians 
reviewed the medical records of cases (and when avail- 
able the reports from autopsies) that could be included 
in the study, filled an extraction form for each case, and 
classified them as levels 1, 2, or 3 according to the gold 
standard criteria proposed by the PHMRC. Only cases 
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Table 3 Examples of gold standard criteria for adult breast cancer, adult acute myocardial infarction, child 
pneumonia, and neonatal birth asphyxia 

Adult breast cancer 

Level 1 One of the following: 

• Operative specimen with histological confirmation 

• Biopsy/fine needle aspiration cytology 

Level Both of the following: 
2A 

• Mammography diagnosis 

• Imaging evidence of metastases in bone, lung, etc. based on CT scan/MRI/X-rays 

Level Patient under treatment from a recognized cancer hospital or cancer unit for breast cancer in cases where the basis for the initial 
2B diagnosis is no longer available. 

Adult acute myocardial infarction 

Level 1 Evidence of acute Ml within three months preceding death based upon one or more of the following: 

• Cardiac perfusion scan 

• ECG changes 

• Documented history of CABG or PTCA or stenting 

• Coronary angiography 

• Enzyme changes (any troponin elevation or CK-MB isoenzyme elevation >2 times the upper limit of normal) in the context of 
myocardial ischemia 

Level Clinical evidence of the following: 
2A 

• Sudden death within six hours of the onset of characteristic shock and chest pain when the case has been witnessed by a physician 
Child pneumonia 

Level 1 Chest X-ray showing primary end-point consolidation, pleural effusion or other consolidation/infiltration, plus two or more of the 
following: 

• Respiratory rate >70/minute 

• Severe lower chest indrawing 

• Abnormal breath sounds (i.e., grunting, decreased breath sounds, crepitations) 

• Rectal temperature >38°C or <36°C 

• Oral or axillary temperature >37.5°C or <35.5°C 
Neonatal birth asphyxia 

Level 1 Each of the following: 

• Failure both to breathe spontaneously and to cry at birth 

• No major congenital abnormality 

• Not a stillbirth (one or more signs of life at birth like pulse or movement) 
Plus one of the following in the 24 hours after birth: 

• Not feeding 

• Hypotonia 

• Seizures 

• Needed and failed resuscitation at birth 

Level 1 is the most stringent criteria, while level 2A or 2B were also collected for some causes. 



classified as levels 1 and 2 were considered eligible for 
the study. The original design considered the inclusion 
of only one to three large hospitals in Mexico City, but 
due to the difficulty of completing the quota of gold 
standard cases, hospitals from the health service net- 
work of the Federal District government and from the 
Ministry of Health of the state of Morelos were 
included. The data were collected from 36 public hospi- 
tals: 33 from the Federal District and three from 
Morelos. 



In Pemba, there are four major government hospitals 
on the island, though most facilities do not have a certi- 
fied medical doctor present and are managed by medical 
assistants and nurses. Surveillance systems were put in 
place in all four hospitals to identify deaths and to clas- 
sify them into GS categories. The hospital supervisor 
recorded complete identification information upon 
admission of each patient, and the attending physician 
medical assistant confirmed the admission diagnosis. 
Hospital supervisors ensured that the signs and 
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symptoms experienced by the patient were recorded and 
that a mortality form with the cause(s) of death was 
filled out by the attending physician in the event of a 
death. All forms were sent back to the field headquar- 
ters for data entry. A computer algorithm was run to 
identify cases meeting GS criteria, and all GS cases were 
recorded in a database. A computer listing was prepared 
with identifier information to schedule the VA 
interviews. 

In Uttar Pradesh, the gold standard deaths were 
enrolled at CSMMU, Lucknow, which is a tertiary care 
government facility with patient inflow from all over 
Uttar Pradesh and bordering states, including districts in 
the neighboring country of Nepal. The catchment area 
spreads over a radius of more than 500 km, of which 
about 85% cases come from 13 districts surrounding 
Lucknow. There was 24-hour surveillance at facilities 
and all patients were enrolled with an address. When a 
death occurred, the project medical officer reviewed the 
patient case sheet in consultation with the resident doc- 
tor in order to assess the GS levels against standard 
criteria. 
VA interview 

Once enrolled, the VA interviewers at each site attended 
a training session led by consortium members using 
standardized materials and an interviewer's manual. The 
training manuals provided information on the study 
background, the roles and responsibilities of the VA 
interviewer, background on how VA cases were selected, 
instructions for administering the questionnaire, and 
information on every question in the instrument. The 
manual provided guidance on how to handle an array of 
questions or concerns, tips for building rapport with the 
respondents, and probing as needed to collect reliable 
information. 

Following the training, VA assignments were given to 
interviewers blinded to the medical information or 
cause of death of the decedent along with directions or 
map queues to the households. In some sites the 
families were contacted in advance to schedule an 
appointment, though this decision was left to the sites' 
discretion. All interviews were collected after a culturally 
appropriate grieving period had passed. The minimum 
grievance period was six days in Bohol and the maxi- 
mum was six months in Mexico (as required by the 
ethics boards at the hospitals). The maximum amount 
of time post-death that an interview was collected was 
eight months in the Mexico site. 

The rate of interview refusals varied by site from 1.8% 
to 9.5%. For those that consented to a verbal autopsy, 
the instrument was administered on paper in the field, 
and returned to the field headquarters for double data 
entry. Interviews lasted an average of 45 minutes across 
all of the sites. 



Quality control of fieldwork and data entry 

To ensure the highest quality data was collected, quality 
control checks were performed both at the individual 
site level, as well as at the Institute for Health Metrics 
and Evaluation (IHME), where all data were transmitted 
through a secured password-protected site for analysis. 

In all sites, supervisors were trained in the protocols 
for monitoring quality control at the site level. Supervi- 
sors were instructed to observe VA interviewers in the 
field during the early stage of data collection to ensure 
they were conducted properly and to provide guidance. 
Supervisors additionally checked every VA form col- 
lected throughout the study to ensure that it was filled 
out consistently and correctly. If issues were identified 
by the supervisor, a reinterview was conducted as 
needed. The field interviewers had periodic meetings 
with their supervisors to discuss performance, progress, 
and challenges. Supervisors at most sites additionally 
reinterviewed a portion of the verbal autopsies to spot 
check the quality of the information collected. 

At IHME, we systematically evaluated all datasets elec- 
tronically for numerous types of quality issues by a com- 
prehensive set of codes. First, we reviewed the dataset 
for missing values and for incorrect skip patterns that 
result in specific questions having been filled in or left 
blank erroneously. The dataset was also evaluated to 
determine if any of the observed values fell outside of 
expected ranges. For example, if the response for a neo- 
natal symptom duration was greater than 28 days (the 
cutoff for classification as a neonatal death), this value 
was flagged. Next, if the dataset was submitted in multi- 
ple sections, we examined the final comprehensive data- 
base for any technical issues that may have occurred in 
merging the individual files. Finally, we merged the data- 
set with the gold standard medical record information, 
which was separately transmitted to IHME by the site 
coordinator. We examined the observations for consis- 
tency between the two sources of information, such as 
the sex of the decedent as reported in the medical 
record and as reported by the verbal autopsy respon- 
dent. Any issues determined through this stringent 
checking process were compiled into a report and sent 
to the site to review. Site coordinators were asked to 
speak with the interview staff and rectify any correctable 
issues such as data entry mistakes. 

Generation of dichotomized variables 

In addition to the full dataset as it was collected, we 
have also created a series of dichotomous variables from 
each of the polytomous (categorical) and continuous 
(duration) variables. Some analytical methods can only 
use dichotomized variables, so this effort to create the 
dichotomous variables increases the information avail- 
able to these types of empirical methods. For each 
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continuous duration item, depending on the item, we 
identified a short or long cutoff. For example, a duration 
of 8.8 days marks long duration of a fever. If a VA 
reports a fever of 10 days, it is considered to have the 
symptom of "having a long fever." We determine the 
cutoff as being two median absolute deviations above 
the median of the mean durations across causes (MAD 
estimator). The MAD estimator can be used as a robust 
measure of the standard deviation and is especially use- 
ful in cases where extremely long durations may be 
reported, which would bias measures such as the stan- 
dard deviation. Additional file 9 shows the cutoffs for 
each item developed in this way. For polytomous vari- 
ables, we examined the pattern of the endorsement rates 
across causes and mapped the categories into two, thus 
creating a dichotomous version of the variable. For 
example, we judged that there was a stronger signal pro- 
duced by combining moderate and severe fevers. Addi- 
tional file 10 shows the mapping of each response 
category into dichotomous variables. Based on the data 
collected, some polytomous variables appeared to have 
little or no information content and were not mapped 
into a dichotomous form. These low information con- 
tent items are shown in Additional file 11. This exercise 
was undertaken for neonatal, child, and adult modules 
separately. 

Inclusion of health care experience 

There has long been concern that the performance of a 
VA instrument and the associated analytical method for 
assigning cause could be different for deaths where the 
decedent died in a hospital or had made extensive use 
of health services prior to death, compared to deaths 
with no health care experience (HCE). As an attempt to 
examine how VA may work in communities with lim- 
ited or no access to health care services, Murray et al. 
[12] studied how PCVA and the Symptom Pattern 
Method performed when all items referring to use of 
health services such as "Have you ever been diagnosed 
with..." or hospital records or death certificates were 
excluded from the analysis. They showed that, in China, 
recall of the household or possession of medical records 
recorded in the VA interview had a profound effect on 
both the concordance for PCVA as well as the perfor- 
mance of the Symptom Pattern Method. 

Given this empirical finding, we believe it is useful to 
test how excluding household recall of health care 
experience likely provides a more realistic assessment of 
how VA performs in communities without access to 
health services. As such, we have created two versions 
of the datasets developed above, one version with all 
variables and one version excluding recall of health care 
and medical records. Specifically, the without HCE data- 
set excludes the following information. First, a series of 



questions asked if the deceased had any specified condi- 
tions, which would likely indicate a health care provider 
had diagnosed the individual. Each of the following con- 
ditions was asked: "Did decedent have [asthma, hyper- 
tension, obesity, stroke, tuberculosis, AIDS, arthritis, 
cancer, COPD, dementia, depression, diabetes, epilepsy, 
heart disease]?" Second, if any medical records were 
available, the interviewer was asked to provide a tran- 
scription of the last note on the medical record. Third, 
if a death certificate was available, the interviewer was 
asked to record the immediate cause of death, first 
underlying cause, second underlying cause, third under- 
lying cause, and contributing causes from the death cer- 
tificate. Finally, at the end of the questionnaire, an 
open-ended section was provided to collect any com- 
ments from the interviewer, as well as to ask the 
respondent "to summarize, or tell us in your own 
words, any additional information about the illness and/ 
or death of your loved one?" Excluding this entire sec- 
tion excludes both open narrative recall of HCE but 
also, in the case of PCVA, excludes any other informa- 
tion on timing and sequencing of signs and symptoms 
that might be conveyed in this section. 

Processing free text for use in empirical methods 

The structured instrument includes various open text 
items. First, some questions in the instrument ask the 
respondent to choose from a list of specified response 
options. For example, "Where was the rash located?" 
has the following response options: face, trunk, extremi- 
ties, everywhere, or "other (specify: )." If the 

response is not one of the listed options, the respondent 
is asked to fill in the location of the rash as the "other" 
response. The questions that include an "other" free text 
response option are as follows: "Where was the rash 
located?"; "Where was the pain located?"; "Which were 
the limbs or body parts paralyzed?"; "What kind of 
tobacco did [NAME] use?"; "Did [NAME] suffer from 

an injury or accident such as a ?"; "Where was the 

deceased born?"; "What were the abnormalities?" in 
reference to any abnormalities at time of delivery; 
"Where did the deceased die?"; "What was the color of 
the liquor when the water broke?" in reference to labor; 
"Where did the delivery occur?"; and "Who delivered 
the baby?" In the questions that collect information 
about a health facility or midwife, free text responses 
collected the name and address of the place or person. 
In addition to these free text items, if any medical 
record or death certificates were available, the inter- 
viewer was asked to transcribe the information from the 
records as free text. Finally, at the end of each interview, 
the open narrative question "Summarize, or tell us in 
your own words, any additional information about the 
illness and/or death of your loved one?"(as described 
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above) was collected in addition to any notes from the 
interviewer. 

Open text could in theory be highly informative, espe- 
cially household recall of HCE and an interviewer's 
direct recording of death records or hospital records 
kept by the household. These observations are likely to 
be available in populations with some access to health 
care services. To make this information available to 
automated methods, we processed open text in the fol- 
lowing steps. First, all free text was compiled into a 
database and a dictionary was created to map all similar 
words to the same stem word. For example, the terms 
AMI, myocardial infarction syndrome, acute myocardial 
infarction, ISHD, MI, coronary heart disease, CHD, 
IHD, MCI, and MYIN would all be mapped by the dic- 
tionary into the same variable ("IHD: Acute Myocardial 
Infarction"). Next, a program called README [42] 
extracts each individual variable and assigns a frequency 
count for the number of times it appears in the entire 
free text database. Variables that are not deemed to be 
diagnostically relevant or that are very low in frequency 
are then dropped from the dataset. The final product is 
a condensed dictionary of medically important terms 
consisting of 106 variables for adults, 90 for children, 
and 39 for neonates. These terms are added as addi- 
tional binary symptoms (present or not present) in the 
VA database. If any of the terms appear in the free text 
for a particular death, it is counted as a positive endor- 
sement for that symptom. These symptoms are not used 
in the "without" HCE dataset. Additional file 12 provides 
the comprehensive dictionary that was developed. 

Analysis datasets 

For empirical VA methods that must be developed using 
the pattern of responses observed in a dataset, validation 
needs to be undertaken on a set of deaths that were not 
included in the development of the method. This is the 
concept of a training dataset distinct from a test dataset. 
Further, as recommended in Murray et al. [15] it is 
important to have test datasets with widely varying 
cause-specific mortality fractions (CSMFs) so that a VA 
method does not by chance appear to be better than 
another because of the specific CSMF composition in 
the training set. To facilitate strict comparability, we 
have created 500 train-test dataset pairs. Each pair was 
created by first splitting the data randomly (without 
replacement) into 75%/25% training and test datasets, 
cause by cause, and then resampling the data in the test 
dataset (with replacement) to have 7,836 adult, 2,075 
child, 1,629 neonatal, and 1,002 stillbirth deaths, match- 
ing a cause composition drawn from an uninformative 
Dirichlet distribution (Figure 1). In other words, each 
test dataset has been resampled to have a different 
CSMF composition. Because the CSMF compositions 
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Figure 1 The process of generating 500 test and training 
datasets (done separately for each cause of death). 



have been drawn from an uninformative Dirichlet, 
across the 500 test datasets, there are cases where any 
given cause has a cause fraction near zero and cause 
fractions as high as 20% or more. By the nature of this 
sampling strategy, there is no correlation between the 
CSMF composition of the training and test dataset pairs. 

Shortened cause lists 

In order to have an efficient cause list for the analysis, 
we have reduced it in two steps as illustrated in Table 4. 
From the original gold standard target cause list we 
received deaths from the sites for 53 diseases in adults, 
27 in children, and 13 in neonates, excluding stillbirths. 
The first step was to select only those causes with 15 or 
more deaths (see Additional file 5 for a detailed map- 
ping), and due to that decision we reduced the list into 
46 adult causes, 22 child causes, and 12 neonate causes, 
excluding stillbirths. For instance, pelvic inflammatory 
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Table 4 Reduction in number of causes to the final 
analysis cause list, excluding stillbirths 





Adult 


Child 


Neonate 


Target cause list 


53 


27 


13 


>15 deaths 


46 


22 


12 


Cross classification 


34 


21 


10 



diseases, uterine cancer, and dementia in adults; AIDS 
with tuberculosis in children; and meningitis in neonates 
had fewer than 15 deaths each. We also eliminated per- 
tussis in children and neonatal tetanus because no per- 
tussis and only four neonatal tetanus deaths were 
gathered. These deaths were assigned to one of the 
remaining categories, such as residual categories like 
"other defined cancers" or "other childhood infectious 
diseases." In the next step we explored the frequency 
with which one cause was erroneously classified as 
another cause in the analysis. For example, deaths due 
to maternal hemorrhage were often assigned to anemia 
in the analysis and vice versa. Similarly, all types of dia- 
betes in adults (diabetes with coma, with renal failure, 
or with skin infection), sepsis with and without local 
bacterial infection in children, and respiratory distress 
syndrome in neonates regardless of the gestational age 
were all frequently hard to differentiate in the analysis. 
The causes that were frequently confused with each 
other were aggregated into a new cause in the final ana- 
lysis cause list. For example, all six maternal causes 
were combined into one maternal category. After this 
step, the final cause list for analysis had 34 causes for 
adults, 21 for children, and 10 for neonates, excluding 
stillbirths. 

Results 

Table 5 shows that of the 12,542 deaths collected as 
gold standard cases for the study, the vast majority 



Table 5 Numbers of VAs collected by site and gold 
standard level 



Site 


Adult 


Child 


Neonate 


Total 




Level 
1 


Level 
2 


Level 
1 


Level 
2 


Level 
1 


Level 
2 




Andhra 
Pradesh 


1,285 


269 


385 


66 


376 


1 


2,382 


Bohol 


998 


262 


234 


30 


374 


0 


1,898 


Dar es 
Salaam 


1,556 


162 


366 


106 


1,047 


2 


3,239 


Mexico 


1,373 


215 


124 


4 


313 


2 


2,031 


Pemba 
Island 


266 


31 


156 


105 


261 


3 


822 


Uttar 
Pradesh 


1,277 


142 


412 


87 


251 


1 


2,170 


Total 


6,755 


1,081 


1,677 


398 


2,622 


9 


12,542 



(88%) were deaths that met the highest level of GS cri- 
teria (level 1). This number varies from 84% in Bohol to 
91% in Dar es Salaam; and by age, 86% of adult deaths 
were level 1, 81% of child deaths, and 99.7% of neonate 
deaths. The majority of the remaining 12% level 2 
deaths were adults. 

It is interesting to note the cause distribution by qual- 
ity of the gold standards. Table 6 presents the break- 
down of how many level 1 and level 2 GS cases were 
collected for each of the 53 adult causes. Eighty-six per- 
cent of adult deaths were level 1, 13% were level 2A, 
and 1% were level 2B. Twenty five causes of death, 
which represent 47% of all adult causes, were exclusively 
level 1. For the remaining 28 causes, the frequency of 
level 1 deaths varies, such as cirrhosis and asthma with 
less than 30% level 1 cases; pneumonia and sepsis with 
between 30% and 60% level 1 cases; and stroke, lung 
and esophageal cancers, and tuberculosis with between 
60% and 75% level 1 cases. Table 7 shows the results for 
the 2,075 deaths in children. Eighteen causes of death, 
which comprise 67% of all of the child causes, reached 
the level 1 gold standard. Another six causes do not 
achieve more than 60% of gold standard level 1 and 
vary from 0% (measles) to more than 50% (malaria, 
pneumonia, and sepsis). Table 8 shows that the level of 
quality was very high for the 1,629 neonatal deaths and 
1,002 stillbirths. 

The distribution of cases (all criteria levels combined) 
across the six sites is shown in Additional file 13. The 
relative distribution of cases by age of death across sites 
reflects their overall progress with mortality transition. 
Thus adult deaths were comparatively fewer in Pemba 
compared to all other sites where 1,200 to 1,600 cases 
were typically collected. Larger numbers of child deaths 
were collected in Dar es Salaam and Uttar Pradesh, 
where child death rates are higher than elsewhere. Simi- 
lar numbers of neonatal deaths were collected in each 
site (250 to 400) except for Dar es Salaam. In this case, 
the site collected VAs on a significantly higher number 
of neonatal deaths (1049) than was targeted, as the site 
had the VA interviewer capacity to easily add these 
cases as they were identified. For example, while the tar- 
geted number of stillbirth deaths was 100, the Dar es 
Salaam site was able to easily collect interviews on 432 
cases to help build a more robust dataset. 

Discussion 

PHMRC was able to obtain completed VA interviews for 
more than 12,000 deaths with GS assignment of true 
cause of death. Because of the poor quality of medical 
record-keeping and limitations of diagnostic technology 
in many hospitals, to identify more than 12,000 GS 
deaths required reviewing and screening a much larger 
number of records. While it was difficult in many sites 
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Table 6 Numbers of VAs collected by cause of death and 
gold standard level for adult causes 



Adult causes 


Level 1 


Level 2A 


Level 2B 


AIDS 


3-15 


0 


8 


AIDS with TB 


148 


o 


0 


Acute myocardial infarction 


376 


24 


o 


Anemia 


68 


0 


o 


Asthma 


13 


34 


o 


Bite of venomous animal 


66 


0 


o 


Breast cancer 


179 


3 


12 


COPD 


1 70 


1 


o 


Cervical cancer 


127 


23 


5 


Cirrhosis 


82 


23 1 


0 


Colorecta cancer 


85 


6 


8 


Dementia 


1 


0 


0 


Diabetes with coma 


144 


0 


o 


Diabetes with renal failure 


156 


o 


o 


Diabetes with skin infection/sepsis 


1 1 -1 


o 


o 


Diarrhea/dysentery 


221 


7 


o 


Drowning 


106 


o 


o 


Epilepsy 


47 


1 


o 


Esophageal cancer 


26 


13 


1 


Falls 


1 73 


0 


o 


Fires 


122 


0 


o 


Hemorrhage 


1 1 1 


3 


o 


Homicide 


167 


0 


o 


Hypertensive disorder 


107 


6 


o 


Congestive heart fai ure 


221 


0 


o 


nflammatory heart disease 


42 


0 


o 


Leukemia 


71 


2 


5 


Liver cancer 


29 


0 


2 


Lung cancer 


66 


36 


4 


Lymphomas 


7-1 


0 


3 


Malaria 


89 


1 1 


o 


Mouth/oropharynx cancer 


22 


0 


o 


Obstructed labor 


1 7 


1 


o 


Other cancers 


142 


o 


o 


Other cardiovascular diseases 


153 


0 


0 


Other digestive diseases 


166 


o 


o 


Other infectious diseases 


258 


0 


0 


Other injuries 


103 


0 


0 


Other noncommunicable diseases 


200 


0 


0 


Other pregnancy-related deaths 


89 


0 


0 


Ovarian cancer 


32 


1 


0 


Pelvic inflammatory disease 


5 


0 


0 


Pneumonia 


310 


229 


0 


Poisonings 


86 


0 


0 


Prostate cancer 


40 


8 


0 



Table 6 Numbers of VAs collected by cause of death and 
gold standard level for adult causes (Continued) 



Renal failure 


411 


2 


0 


Road traffic 


202 


0 


0 


Sepsis 


24 


46 


0 


Stomach cancer 


50 


10 


2 


Stroke 


378 


252 


0 


Suicide 


124 


0 


0 


TB 


196 


79 


0 


Uterine cancer 1 1 1 



to obtain sufficient documentation for some causes of 
death overall across all six sites, we were able to find 
enough deaths for 46 adult causes, 22 child causes, and 
12 neonate causes, excluding stillbirths, from the origi- 
nal cause list. The implementation of the project 
revealed just how poor the quality of medical records 
and diagnosis is in some institutions. This finding reaf- 
firms our original hypothesis that convergent validity 
between verbal autopsy and poorly assigned hospital 
cause of death is not a measure of criterion validity. 

An important potential limitation of the study is the 
extent to which the cause of death based on fulfilling 
the clinical, laboratory, medical imaging, and tissue 
pathology criteria in this study are the true cause of 
death. Studies in high-resource settings [43] suggest that 
clinical diagnosis compared to postmortem autopsy may 
differ in up to 25% of cases. These studies, however, 
exaggerate the limitations of our study using clinical 
diagnostic criteria for three reasons. First, autopsies are 
much more likely to be undertaken in medico-legal 
cases or cases with uncertain clinical diagnosis. Shojania 
et al. found that once the inherent selection bias of 
postmortem autopsy is taken into account, clinical diag- 
nosis and postmortem autopsy agree more than 90% of 
the time [44]. Second, these comparisons are for all clin- 
ical diagnoses, not for the subset that meets our clearly 
defined and stringent criteria. In general, less than one- 
third of hospital deaths in our study fulfilled our diag- 
nostic criteria even in the most sophisticated hospitals. 
It is a reasonable assumption that the concordance 
between the clinical diagnosis and postmortem autopsy 
would be even higher in the subset meeting our criteria. 
Finally, the definition in these studies of major diagnos- 
tic discrepancy is for clinical purposes, not for the pur- 
poses of assigning underlying cause of death. For the 
latter effort, some of the major discrepancies would not 
move deaths between cause of death categories used in 
this study. 

Some readers may object to the use of "gold standard" 
in describing our dataset. We believe, however, that we 
have implemented the best possible approach to 
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Table 7 Numbers of VAs collected by cause of death and 
gold standard level for child causes 



Child causes 


Level 1 


Level 2A 


Level 2B 


AIDS 


19 


0 


0 


AIDS with TB 


1 


o 


0 


Bite of venomous animal 


54 


o 


o 


Diarrhea/dysentery 


255 


1 


o 


Drowning 


82 


1 


o 


Encephalitis 


41 


o 


o 


Falls 


49 


o 


o 


Fires 


68 


o 


o 


Hemorrhagic fever 


51 


o 


o 


Malaria 


59 


58 


o 


Measles 


0 


23 


o 


Meningitis 


58 


0 


o 


Other cancers 


28 


o 


o 


Other cardiovascular diseases 


76 


o 


o 


Other defined causes of child deaths 


182 


o 


o 


Other digestive diseases 


48 


o 


o 


Other infectious diseases 


60 


o 


o 


Other respiratory diseases 


12 


o 


o 


Pertussis 


0 


0 


0 


Pneumonia 


272 


224 


1 


Pneumonia and diarrhea 


35 


3 


0 


Poisonings 


18 


0 


0 


Road traffic 


92 


0 


0 


Sepsis (with local bacterial infection) 


22 


15 


0 


Sepsis (without local bacterial infection) 


39 


67 


0 


TB 


4 


5 


0 


Violent death 


52 


0 


0 



assigning causes of death. In nearly all settings, post- 
mortem rates are low and subject to severe selection 
bias toward diagnostically challenging and nonrepresen- 
tative deaths for a cause. For both implementation and 
selection bias reasons, we do not foresee VA validation 
studies being undertaken using large samples of deaths 
with postmortem autopsies. Clearly defined clinical, 
laboratory, imaging, and tissue pathology criteria as used 
in this study are the best that can be implemented. As 
such, we believe the use of the term gold standard for 
this dataset is appropriate. 

A particularly vexing issue in VA validation studies is 
that by their nature they are conducted on deaths that 
have occurred in hospital. What would be the perfor- 
mance of VA for deaths in the community? There are 
potentially three distinct aspects to this question. First, 
the cause-composition of deaths in the hospital and the 
community will be different. Fortunately, because we 
create multiple test datasets with widely varying cause 
compositions, this issue will not influence the results 
from VA validation studies as long as the methods 
recommended by Murray et al. [15] are followed. Sec- 
ond, contact and experience with the health system 
could change the way in which household members 
recall certain symptoms or signs. If it does, then VA 
may capture more information in those cases with hos- 
pital experience than when implemented in a population 
with little or no experience of health care. Given that all 
validation studies require some diagnostic information 
on the course of illness prior to death, no validation 
study can ever investigate this question. This is an 
unfortunate reality; we believe that constructing a data- 
set, as we have done, that excludes all information from 
the household about medical experience prior to death 



Table 8 Numbers of VAs collected by cause of death and gold standard level for neonatal causes 



Neonate causes 


Level 1 


Level 2 A 


Level 2B 


Birth asphyxia 


461 


0 


0 


Congenital malformation 


250 


0 


0 


Meningitis (serious infection) 


6 


0 


0 


Pneumonia (serious infection) 


84 


5 


0 


Preterm delivery (<33 weeks gestational age [GA]) without respiratory distress syndrome (RDS) 


353 


0 


0 


Preterm delivery (with or without RDS) and sepsis 


75 


1 


0 


Preterm delivery (without RDS) and birth asphyxia 


89 


0 


0 


Preterm delivery (without RDS) and sepsis and birth asphyxia 


34 


0 


0 


Respiratory distress syndrome (33-36 weeks GA) 


13 


0 


0 


Respiratory distress syndrome (<33 weeks GA) 


97 


0 


0 


Sepsis (serious infection) 


127 


1 


0 


Sepsis with local bacterial infection 


32 


1 


0 


Stillbirth 


1,001 


1 


0 


Tetanus 


4 


0 


0 
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is the closest we can come in a validation study to 
understanding how VA will perform in a poor, under- 
served community. While it is theoretically possible that 
household recall of symptoms and signs will be different 
if someone has experienced health care prior to death, 
there is in fact no direct evidence for this hypothesis, 
nor is it clear how it would be tested. Third, the clinical 
course and thus the signs and symptoms related to a 
cause of death may be influenced through contact with 
the health system. As with the second limitation, there 
is unfortunately no way to investigate this important 
issue. We simply have no way to figure out the true 
cause of death for deaths that have occurred in the 
community with no contact with health services. 

Ideally, all countries would have in place functioning 
vital registration systems that capture all deaths and 
include a medically certified cause of death according 
to the procedures and rules of the International Classi- 
fication of Diseases in force at the time. While pro- 
gress toward this goal is being made, it is painfully 
slow, and without greater government commitment, 
will not be a reality for most developing countries for 
decades to come [45,46]. To meet urgent policy and 
planning needs, countries will have no alternative but 
to introduce verbal autopsy, at least for deaths that 
occur outside hospitals. It is critically important that 
they have confidence in the VA methods they use, and 
that they understand the validation and performance 
characteristics of those methods. We believe that to do 
so, validity and comparative performance must be 
assessed against rigorous, standardized criteria that 
unambiguously identify the cause of death, and that 
are not influenced whatsoever by the quality, usually 
very poor, of medical records or the diagnostic biases 
of physicians who review them. Our study has com- 
piled the first ever dataset of gold standard cause of 
death assignments across six sites in four countries. It 
is unlikely that a comparable dataset on VA with true 
gold standard cause of death ascertainment will be col- 
lected in the near future, if for no other reason than 
the substantial cost and time investment. For quite 
some time, therefore, the PHMRC will be the largest 
and most rigorously collected VA validation set. We 
intend to make the dataset publicly available in the 
hope that it will serve as a resource for the broader 
VA scientific community interested in developing and 
testing new methods. For this reason, we plan to 
release to the public an anonymized version of the 
dataset once the primary set of analyses from the 
investigators have been published. 

One lesson learned from the complexity of converting 
free text into dichotomous variables is that future VA 
instruments may want to incorporate a series of check- 
list questions based on the free text variables that 



improve VA performance. Rather than free text, items 
could be included such as "Did anyone tell you or do 
you have any documentation mentioning acute myocar- 
dial infarction, MI, ischemic heart disease, or coronary 
heart disease?" These checklist items would be com- 
pleted by the interviewer after questioning the respon- 
dent and examining the medical records and other 
documentation available. In this way, the task of reading 
free text and translating it through a dictionary would 
be simplified and focused only where it is likely to 
change the results. 

Conclusion 

We have described the development and usefulness of 
the largest, perhaps only dataset with gold standard 
cause of death assignment and matching verbal autop- 
sies for more than 12,000 deaths in four countries. 
We expect that this will facilitate further development 
of verbal autopsy and perhaps other cause of death 
measurement approaches in countries with poor vital 
registration and certification practices. The utility of 
this dataset will undoubtedly improve if additional 
cases, in different populations, and for different dis- 
eases than those reported here, are added in future 
studies, provided the same protocols and standards 
are applied. In this way, confidence in the utility of 
verbal autopsy methods will increase and result in 
their wider application in countries to reduce ignor- 
ance about the comparative importance of leading 
causes of death. 
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Additional file 13: Final analysis cause list and numbers of deaths 
by site. 
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